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Abstract 

Validating models of airspace operations is a particu- 
lar challenge. These models are often aimed at finding 
and exploring safety violations, and aim to be accurate 
representations of real-world behavior. However, the 
rules governing the behavior are quite complex: non- 
linear physics, operational modes, human behavior, and 
stochastic environmental concerns all determine the re- 
sponses of the system. In this paper, we present a study 
on aircraft runway approaches as modeled in Geor- 
gia Tech’s Work Models that Compute (WMC) simu- 
lation. We use a new learner, Genetic-Active Learning 
for Search-Based Software Engineering (GALE) to dis- 
cover the Pareto frontiers defined by cognitive struc- 
tures. These cognitive structures organize the prioriti- 
zation and assignment of tasks of each pilot during ap- 
proaches. We discuss the benefits of our approach, and 
also discuss future work necessary to enable uncertainty 
quantification. 

The Motivation — Complexity in Aerospace 

Complexity that works is built of modules that work 
perfectly, layered one over the other. -Kevin Kelly 

The National Airspace System (NAS) is complex. Each 
airplane is an intricate piece of machinery with both me- 
chanical and electrical linkages between its many com- 
ponents. Engineers and operators must constantly decide 
which components and interactions within the airplane can 
be neglected. As one example, the algorithms that control 
the heading of aircraft are usually based on linearized ver- 
sions of the actual (very nonlinear) dynamics of the aircraft 
in its environment. (Blakelock 1991) Each airplane must 
also interact with other airplanes and the environment. For 
instance, weather can cause simple disruptions to the flow 
of the airspace, or be a contributing factor to major disas- 
ters. (NTSB 2010) Major research efforts are currently fo- 
cused on models and software to mitigate weather-based 
risks. (Le Ny and Balakrishnan 2010) 

The glue for these interacting airspace systems consists 
primarily of people. Pilots and air traffic controllers are the 
final arbiters and the primary adaptive elements; they are 
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expected to compensate for weather, for mechanical fail- 
ures, and for others’ operational mistakes. They are also the 
scapegoats. Illustratively, examine the failure of a software 
system at the Los Angeles Air Route Traffic Control Cen- 
ter on September 14, 2004. (Geppert 2004) Voice communi- 
cations ceased between the controllers and the 400 aircraft 
flying above 13,000 feet over Southern California and adja- 
cent states. During the software malfunction, there were five 
near-misses between aircraft, with collisions prevented only 
by an on-board collision detection and resolution system 
(TCAS). At the time, the FAA was in the process of patch- 
ing the systems. As often happens in software-intensive sys- 
tems, the intermediate ‘fix’ was to work around the problem 
in operations — the software system was supposed to be re- 
booted every 30 days in order to prevent the occurrence of 
the bug. The human operators hadn’t restarted the system, 
and they were blamed for the incident. 

If the current state of airspace complexity causes palpita- 
tions, experts considering what might happen in the planned 
next generation (NextGen) airspace can be excused for full- 
fledged anxiety attacks. The future of the NAS is more het- 
erogeneity and more distribution of responsibility. We are 
seeing a switch to a best-equipped best-served model — 
airlines who can afford to buy and operate equipment can 
get different treatment in the airspace. One example is the 
advent of Required Navigation Performance (RNP) routes, 
in which aircraft fly tightly-controlled four-dimensional tra- 
jectories by utilizing GPS data. With GPS, an aircraft can 
be cleared to land and decscend from altitude to the runway 
in a Continuous Descent Arrival (CDA): these approaches 
save fuel and allow for better-predicted arrival times. How- 
ever, at airports with these approved routes, controllers must 
work with mixed traffic — airplanes flying CDA routes and 
airplanes flying traditional approaches. In the future, the 
airspace will also include Unmanned Aerial Systems (fully- 
autonomous systems, and also systems in which a pilot flies 
multiple aircraft from the ground), and a wider performance 
band for civil aircraft. 

The overall traffic increase is leading to software-based 
decision support for pilots and controllers. There is an ac- 
tive (and sometimes heated) discussion about just how much 
authority and autonomy should remain with people versus 
being implemented in software. Decisions about where loci 
of control should reside in the airspace is an example of a 


wicked design problem (Rittel 1984; Hooey and Foyle 2007) 
as evidenced by the following criterion: 

• Stakeholders disagree on the problem to solve. 

• There are no clear termination rules. 

• There are ‘better’ or ‘worse’ solutions, but not ‘right’ and 
‘wrong’ solutions. 

• There is no objective measure of success. 

• The comparison of design solutions requires iteration. 

• Alternative solutions must be discovered. 

• The level of abstraction that is appropriate for defining the 
problem requires complex judgments. 

• It has strong moral, political, or professional dimensions 
that cannot be easily formalized. 

In this paper we will first discuss a simulation that is de- 
signed to study human-automation interaction for CDAs. We 
will then overview the current state-of-the-art for uncertainty 
quantification within this type of complex system, and focus 
on techniques for exploring the Pareto Frontier. In the next 
section, we will explain and demonstrate a new technique 
(GALE) for quickly finding the Pareto Frontier for ‘wicked’ 
problems like those we study in our use case. We will con- 
clude by overviewing our future plans. 

How Do Pilots and Air Traffic Controllers 
Land Planes? Our Test Case and Its Inputs 

In this paper, we use the CDA scenario within the Geor- 
gia Institute of Technology’s Work Models that Compute 
(WMC). WMC is being used to study concepts of opera- 
tion within the (NAS), including the work that must be per- 
formed, the cognitive models of the agents (both humans 
and computers) that will perform the work, and the under- 
lying nonlinear dynamics of flight. (Kim 2011; Pritchett, 
Christmann, and Bigelow 2011; Feigh, Dorneich, and Hayes 
2012) WMC and the NAS are hybrid systems , governed both 
by continuous dynamics (the underlying physics that allows 
flight) and also discrete events (the controllers’ choices and 
aircraft modes). (Pritchett, Lee, and Goldsman 2001) Hybrid 
systems are notoriously difficult to analyze, for reasons we 
will overview in the next section. 

WMC’s cognitive models are multi-level and hierarchi- 
cal, (Kim 2011) with: 

• Mission Goals at the highest level, such as Fly and Land 
Safely, that are broken into 

• Priority and Values functions such as Managing Interac- 
tion with the Air Traffic System. These functions can be 
decomposed into 

• Generalized Functions such as Managing the Trajectory, 
that can still be broken down further into 

• Temporal Functions such as Controlling Waypoints. 

In this paper, we present some preliminary results in 
which we have varied four input parameters to WMC in or- 
der to explore their effects on the simulation’s behavior. We 
list these parameters in bold, and then describe them. The 
Scenario is a variable within the CDA simulation with four 
values. In the nominal scenario, the aircraft follows the ideal 
case of arrival and approach, exactly according to printed 
charts, with no wind. In the late descent scenario, the air 


traffic controller delays the initial descent, forcing the pilots 
to quickly descend in order to ‘catch up’ to the ideal de- 
scent profile. In the third, unpredicted rerouting, scenario, 
the air traffic controller directs the pilot to a waypoint that is 
not on the arrival charts, and from there returns the pilot to 
the expected route. In the final scenario, the simulation cre- 
ates a tailwind that the pilot and the flight deck automation 
must compensate for in order to maintain the correct trajec- 
tory. The late descent, unpredicted rerouting, and tailwind 
scenarios all have further variants, modifying the times at 
which the descent is cleared, the waypoint that the plane is 
routed to, and the strength of the tailwind, respectively. 

Function Allocation is a variable that describes differ- 
ent strategies for configuring the autoflight control mode, 
and has four different possible settings. A pilot may have 
access to guidance devices for the lateral parts ( LNAV ) and 
the vertical parts (VNAV) of the plane’s approach path. Civil 
transport pilots are likely to have access to a Flight Manage- 
ment System (FMS), a computer that automates many avi- 
ation tasks. In the first Function Allocation setting, which 
is highly automated, the pilot uses LNAV/VNAV, and the 
flight deck automation is responsible for processing the air 
traffic instructions. In the second, mostly automated, setting, 
the pilot uses LNAV/VNAV, but the pilot is responsible for 
processing the air traffic instructions and for programming 
the autoflight system. In the third setting, the pilot receives 
and processes the air traffic instructions. The pilot updates 
the vertical autoflight targets; the FMS commands the lat- 
eral autoflight targets. This setting is the mixed-automated 
function allocation setting. In the final, mostly manual, set- 
ting, the pilot receives and processes air traffic instructions, 
and programs all of the autoflight targets. 

The third parameter we are varying in this paper is the pi- 
lots’ Cognitive Control Modes. There are three cognitive 
control modes implemented within WMC: opportunistic, 
tactical, and strategic. In the opportunistic cognitive control 
mode, the pilot does only the most critical temporal func- 
tions: the actions “monitor altitude” and “monitor airspeed.” 
The values returned from the altitude and the airspeed will 
create tasks (like deploying flaps) that the pilot will then per- 
form. In the tactical cognitive control mode, the pilot cycles 
periodically through most of the available monitoring tasks 
within WMC, including the confirmation of some tasks as- 
signed to the automation. Finally, in the strategic mode, the 
pilot monitors all of the tasks available within WMC and 
also tries to anticipate future states. This “anticipation” is 
implemented as an increase in the frequency of monitoring, 
and also a targeted calculation for future times of interest. 

Finally, the fourth variable we explore in this paper is 
Maximum Human Taskload: the maximum number of 
tasks that can be requested of a person at one time. In 
previous explorations using WMC (Kim 2011), the author 
chose three different levels: tight, in which the maximum 
number of tasks that can be requested of a person at one 
time is 3; moderate, in which that value is 7; and unlim- 
ited, in which a person is assumed to be able to handle up 
to 50 requested tasks at one time. WMC uses a task model 
in which tasks have priorities and can be active, delayed, 
or interrupted. (Feigh and Pritchett 2013) If a new task is 
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passed to a person and that person’s maximum taskload 
has been reached, an active task will be delayed or inter- 
rupted, depending on the relative priorities of the tasks that 
have been assigned. Delayed and interrupted actions may be 
forgotten according to a probability function that grows in 
the time elapsed since the task was active. For the studies 
in this paper, we assume that people can handle between 
1 and 7 tasks at maximum. (Miller 1956; Cowan 2000; 
Tarnow 2010) 

Our analysis seeks to explore the effects that each of 
the four variables above has on the following five out- 
puts: the number of forgotten tasks in the simulation ( Num - 
Forgotten Tasks), the number of delayed actions (NumDe- 
layedActions), the number of interrupted actions ( Numlnter - 
rupted Actions), the total time of all of the delays ( Delayed - 
Time), and the total time taken to deal with interruptions {In- 
terrupt edTime). In our results, we refer to these outputs as 
(fl ... . f5) and average each of their values across the pilot 
and the copilot. 

In Kim’s dissertation (Kim 2011), she primarily studies 
function allocation and its effect on eight different parame- 
ters, including workload and mission performance. In this 
sense, the WMC model by itself as Kim chose to use it 
(much less the airspace it is meant to simulate) is ‘wicked’. 
In particular, there is no single measure of success, and there 
is no agreement as to which of the measures is more im- 
portant. Kim analyzed all of the combinations of the above 
four variables, and manually postprocessed the data in or- 
der to reach significant conclusions about how the level-of- 
automation affects each of her eight metrics. 

An Overview of Uncertainty Quantification 
Within Hybrid, Wicked Systems 

Remember that all models are wrong; the practical 

question is how wrong do they have to be to not be use- 
ful. -George E.P. Box and Norman R. Draper 

Validation is the process by which analysts answer “Did 
we solve the right problem?” Uncertainty (and risk) quantifi- 
cation is core to the validation of safety-critical systems, and 
is particularly difficult for wicked design problems. WMC is 
a tool that is aimed at validating concepts of operation in the 
airspace. It abstracts some components within the airspace, 
and approximates other components, and must itself be vali- 
dated in order to understand its predictive strengths and lim- 
itations. Validation efforts can take as a given that WMC’s 
predictions are useful, and be focused on discovering the 
risks in the concepts of operation (in which case the anal- 
ysis is usually called risk quantification). Uncertainty quan- 
tification within the model is usually focused on comparing 
the predictions to those we get (or to those we expect to get) 
in reality. The questions we are asking in each of these two 
cases are different, but the underlying tools we use in order 
to analyze them is often the same. 

In the case of risk quantification, where we want to vali- 
date the concept of operation, we explore the input and out- 
put spaces of our models, looking for those that perform bet- 
ter or worse among the many metrics we’ve chosen to exam- 
ine. For simulations with long response times, or for which 


we hope to learn about a broad class of behaviors using rela- 
tively few trials, we build a secondary model that is easier to 
evaluate than the original simulation. Whichever surface we 
can evaluate, whether it is the original or a secondary model, 
is called a response surface. In the case of uncertainty quan- 
tification, where we want to validate our model, we again 
build a response surface for our model and compare this 
against the response surface built using real (or expected) 
behaviors. 

A common way of characterizing a response surface is by 
building a Pareto Frontier. A Pareto Frontier occurs when a 
system has competing goals and resources; it is the boundary 
where it is impossible to improve on one metric without de- 
creasing another. (Lotov, Bushenkov, and Kamenev 2004) A 
Pareto Frontier is usually discovered using an optimization 
methodology. In rare cases, it may be possible to analytically 
discover the Pareto Frontier — this is unlikely in wicked de- 
sign problems like those we are studying here. More often, 
we use a learning technique to discover the Pareto Frontier 
given concrete trials of the system. 

Classical optimization techniques are often founded on 
the idea that the response surface and its first derivative are 
Lipschitz continuous everywhere (smooth). For smooth sur- 
faces, it is possible to find a response surface that is arbitrar- 
ily close to our desired function using polynomial approxi- 
mations by the Weierstrass Approximation Theorem. (Bar- 
tie 1976) For the hybrid, complex, non-linear problems we 
are studying here, no such guarantee of smoothness exists. 
Modal variables like the cognitive control modes in WMC 
usually require combinatorial approaches in order to ex- 
plore. For other WMC inputs, such as the maximum hu- 
man taskload, a domain expert might reasonably suspect that 
there is an underlying smooth behavior. For some WMC in- 
puts we haven’t modeled yet, such as flight characteristics 
of the aircraft or the magnitude of a tailwind, there is al- 
most certainly a smooth relationship, but it may be nonlin- 
ear. Classical techniques handle the mix of discrete and con- 
tinuous inputs by solving a combinatorial number (in the dis- 
crete inputs) of optimization problems over the continuous 
inputs, and then comparing the results across the optimiza- 
tions in a post-processing step. (Gill, Murray, and Wright 
1986) This technique can be computationally very expen- 
sive, especially when you consider that continuous optimiza- 
tion techniques are sensitive to local minima (in our nonlin- 
ear aerospace problems), and several different input trials 
should be performed. 

Statistical techniques such as Treed Gaussian Processes 
and Classification Treed Gaussian Processes, can be used to 
build statistical emulators as the response surfaces for sim- 
ulators, and have the advantage that they can model dis- 
continuities and locally smooth regions. (Gramacy 2007; 
He 2012) As a disadvantage, they are limited by computa- 
tional complexity to relatively few inputs (10s but not 100s). 
More recent techniques, such as those based on particle fil- 
ters, can handle significantly many more inputs. (He and 
Davies 2013) 

All of the above techniques have the limitation that they 
optimize for one single best value. To optimize across sev- 
eral criterion (such as the five we analyze for this paper or 
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the eight in Kim’s thesis) using the above techniques, the an- 
alyst usually needs to build a penalty function, a formula that 
is strictly monotonic in improvement across the desired met- 
rics and weights each metric according to its relative value. 
In this paper, we choose instead to explore the class of multi- 
objective response surface methods, as detailed in the next 
section. 

GALE: Active Learning for Wicked Problems 

Wicked problems have many features; the most important 
being that no objective measure of success exists. Designing 
solutions for wicked problems cannot aim to produce some 
perfectly correct answer since no such definition of correct 
exists. Hence, this approach to design tries to support effec- 
tive debates by a community over a range of possible an- 
swers. For example, different stakeholders might first elabo- 
rate their own preferred version of the final product or what 
is important about the current problem. These preferred ver- 
sions are then explored and assessed. 

The issue here is that there are very many preferred ver- 
sions. For example, consider the models discussed in this pa- 
per. Just using the current models, as implemented by Kim 
et al. (Kim 201 1), the input space can be divided 144 ways, 
each of which requires a separate simulation. In our explo- 
ration, we further subdivide the maximum human taskload 
to evaluate 252 combinations. Worse yet, a detailed reading 
of Kim’s thesis shows that her 144 input sets actually ex- 
plore only one variant each for three of her inputs. Other 
modes would need to be explored to handle: 

• Unpredictable rerouting; 

• Different tail wind conditions; 

• Increasing levels of delay. 

If we give three “what-if” values to the above three items 
then, taken together, these 3*3*3*252 modes*inputs would 
require nearly 7000 different simulations 1 . This is an is- 
sue since, using standard multi-objective optimizers such as 
NSGA-II (Deb et al. 2000), our models take seven hours 
to reach stable minima. Hence, using standard technology, 
these 7,000 runs would take 292 weeks to complete. In prin- 
ciple, such long simulations can be executed on modern 
CPU clusters. For example, using the NASA Ames multi- 
core supercomputers, the authors once accessed 30 weeks of 
CPU in a single week. Assuming access to the same hard- 
ware, our 7,000 runs might be completed in under ten weeks. 

The problem here is that hardware may not be available. 
The example in the last paragraph (where 30 weeks of CPU 
time was accessed in one week) was only possible since 
there was a high priority issue in need of urgent resolution. 
In the usual case at NASA, researchers can only access a 
small fraction of that CPU. For example, if there has been 
some incident on a manned space mission, then NASA en- 
lists all available CPU time for “damage modeling” (which 

'To be accurate, there are many more than 7,000 possible sim- 
ulations, especially if we start exploring fine-grained divisions of 
continuous space. Regardless of whether or not we need 7,000 or 
7,000,000 simulations, the general point of this section still holds; 
i.e. wicked problems need some way to explore more options faster. 


is a large series of “what-if” queries that assess the poten- 
tial impact of some event). At those times, researchers can 
access zero CPU for any other purpose. 

GALE, short for Geometric Active Learning Evolution, 
combines spectral learning and response surface methods 
to reduce the number of evaluations needed to assess a set 
of candidate solutions. The algorithm is an active learner; 
i.e. instead of evaluating all instances, it isolates and ex- 
plores only the most informative ones. Hence, we recom- 
mend GALE for simulations of wicked problems. The fol- 
lowing notes are a brief overview on GALE. For full details, 
see (Krall 2014; Krall and Menzies 2014). 

Response surface methods (RSM) generate multiple small 
approximations to different regions of the output space. 
Multi-objective RSMs explore the Pareto frontier (the space 
of all solutions dominated by no other). These approxima- 
tions allow for an extrapolation between known members of 
the population and can be used to generate approximations 
to the objective scores of proposed solutions (so after, say, 
100 evaluations it becomes possible to quickly approximate 
the results of, say, 1000 more). Other multi-objective RSMs 
make parametric assumptions about the nature of that sur- 
face (e.g. Zuluaga et al. assume they can be represented as 
Gaussian process models (Zuluaga et al. 2013)). GALE uses 
non-parametric multi-objective RSMs so it can handle mod- 
els with both discrete and continuous variables. 

GALE builds its response surface from clusters on the 
Pareto frontier. These are found via a recursive division of 
individuals along the principal component found at each 
level of the recursion 2 . Spectral learners like GALE base 
their reasoning on these eigenvectors since they simultane- 
ously combine the influences of important dimensions while 
reducing the influence of irrelevant or redundant or noisy di- 
mensions (Kamvar, Klein, and Manning 2003). Recursion 
on n individuals stops when a division has less than yjn 
members. At termination, this procedure returns a set of leaf 
clusters that it calls best. 

During recursion, GALE evaluates and measures objec- 
tive scores for a small m number of individuals. These 
scores are used to check for domination between two parti- 
tions of individuals, divided at the some middle point (cho- 
sen to minimize the expected variance over each partition) 
of that level’s component. GALE then only recurses on the 
non-dominated half. That is, the best individuals found by 
GALE are clusters along the Pareto frontier. 

GALE is an active learner. During its recursion, when ex- 
ploring n randomly generated solutions, GALE only eval- 
uates at most mlog 2 (n) individuals. One surprising result 
from our experiments is that GALE only needs to check for 
domination on only the m = 2 most separated individu- 
als along the principal component (which is consistent with 
Pearson’s original claim that these principal components are 
an informative method of analyzing data (Pearson 1901)). 

For reasons of speed, GALE uses a Nystrom technique 
(called FASTMAP) to find the principal component (Falout- 
sos and Lin 1995; Platt 2005). At each level of its recursion, 


"The principle component of a set of vectors shows the general 
direction of all the vectors (Pearson 1901). 
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num 

eval 

fl f2 fi f4 f5 

s percentiles 

50th 

(75-25)th 

GALE 

33 

0.8 0.0 0.0 0.0 0.0 

82% 

0% 

NSGAII 

4,000 

1.2 0.0 0.0 0.0 0.0 

84% 

0% 

SPEA2 

3,200 

1.0 0.0 0.0 0.0 0.0 

84% 

1% 

Baseline 

- 

8.2 0.1 0.1 0.2 0.1 

100% 

0% 


Table 1: Raw values for fl, f2, f3, f4, f5 = NumForgotten- 
Tasks, NumDelayedActions, NumlntermptedActions, De- 
layedTime, InterruptedTime, respectively. Lower is better. 


this technique finds in linear time the poles p , q (individuals 
that are furthest apart) and the approximation to the principal 
component is the line from p to q. GALE handles continuous 
and discrete variables by adopting the distance function of 
Aha et al., which can manage continuous and discrete vari- 
ables (Aha, Kibler, and Albert 1991). 

Ostrouchov and Samatova show that the poles found by 
FASTMAP are approximations to the vertexes of the con- 
vex hull around the individuals (Ostrouchov and Samatova 
2005). Therefore, we can use FASTMAP as a response sur- 
face method by extrapolating between the poles of the best 
clusters. Given some initial set of individuals, GALE de- 
fines the baseline to be the median value of all their ob- 
jectives. (Note that this baseline and initial population are 
generated only once, and then cached for reuse.) For each 
cluster c,j £ best and for each pole (p. q) £ Ci, GALE sorts 
the poles by their score (denoted s) where s is the sum of 
the distance of each objective from the baseline (and better 
scores are lower) 2 . The best individuals in leaf clusters are 
mutated towards their better pole by an amount 

\/d£D,d = d + 

s(q ) 

where q is the pole with better (and lowest) score, D are 
the decisions within an individual, and 0 < A < 1 is a 
random variable. GALE grows new solutions using ranges 
in the mutated population. Numbers are discretized into ten 
ranges using ( x—min ) / ((max— min) /10). The most frequent 
range is then found for each feature and new individuals are 
generated by selecting values at random from within those 
ranges. GALE then recurses on the new individuals. 

GALE’s performance has been compared to two stan- 
dard MOEAs (NSGA-II and SPEA2 (Deb et al. 2000; 
Zitzler, Laumanns, and Thiele 2001)) on (a) a software pro- 
cess model of agile development (Lemon et al. 2009), as 
well as (b) a sample of the standard optimization certifica- 
tion problems (Krall 2014; Krall and Menzies 2014). In that 
study, GALE terminated using 20 to 89 times fewer evalu- 
ations. Further, its solutions were usually as good or better 
than those of NSGA-II or SPEA2. The conclusion from that 
study was that GALE’s RSM was a better guide for mutation 
than the random search of NSGA-II or SPEA2. 

Results 

We ran three optimization (GALE, NSGA-II, and SPEA2) 
algorithms on CDA, and show their results in Table 1 . The 

3 To be precise, s is the “loss” measure discussed by (Krall and 
Menzies 2014), as inspired by (Zitzler and Kiinzli 2004). 
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Figure 1 : Visualizations of %loss from baseline of objective 
scores. Number of evaluations are shown on the horizontal 
axis. Y-axis values show objectives achieved expressed as a 
ratio of values seen in the baseline; e.g. y = 50 means that 
an optimizer has achieved some objective value that is half 
that seen in the baseline. Shown as red, blue, green lines is 
the lowest seen objective score at that particular value along 
the X-axis; lower values are better. 


three algorithm rows of this table indicate the final objec- 
tive scores of CDA in the f-columns. The Baseline row (the 
same p = 100 sized population is used for each algorithm) 
indicates the starting points for those objectives. These algo- 
rithm rows should be compared to the baseline row, and the 
improvements are easily noticeable. 

The previous paragraph addresses the validity of CDA, 
e.g. if it can be optimized. Now, we turn who can optimize 
best. When comparing optimizers, we need to compare 1) 
how well, and 2) how fast. In Figure 1, colored lines repre- 
sent the best seen improvements of each algorithm (lower is 
better). In general, each color is evenly matched, except for 
interrupted time (f5), where the red clearly outperforms the 
blue and green. Thus, GALE is better on “how well”. 

As for “how fast”, we return to Table 1 to compare the 
“num eval” of each algorithm. Each evaluation is equiva- 
lent to running CDA one time. Note that the average run- 
ning time of CDA itself is about 8 seconds. This means that 
GALE (33 evals) can optimize the CDA model in about 4 
minutes versus the 7 hours needed by NSGA-II (4000 evals) 
or SPEA2 (3200 evals). That is, GALE is about 60x faster. 

Please note that these results are from running each algo- 
rithm only once each. A more complete study is in progress, 
but due to the recent government shutdown, we were unable 
to complete our goal of 20 runs of NSGA-II and SPEA2. 
However, in keeping with the main message of this paper, 
we note that we could finish all the GALE runs (these are 
not shown since it makes little sense to compare solo runs 
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with multiple runs). 

Conclusion 

In this paper, we’ve shown that GALE can learn the ‘wicked’ 
response surface for an aerospace task management model 
at similar accuracy and much faster than other similar 
techniques. Optimization (GALE), explanation (visualiza- 
tions and charts), and encapsulation (ruleset summarization) 
are tools that together comprise the validation of models. 
GALE’s fast learning allows us to more thoroughly explore 
the envelope of behaviors, leading to overall improved vali- 
dation. 

Our immediate next steps involve the thorough data col- 
lection of experiments described in this paper, since only the 
results forN=l runs of each of NSGA-II, SPEA2 and GALE 
were detailed. Our further plans are then to improve GALE’s 
reporting suite on the learned results; we’d like to generate 
succinct rulesets on how best to build test cases with optimal 
solutions. This will include both ruleset summarization and 
also validity assertions through re-evaluating the model on 
individuals generated via the ruleset. Such a ruleset can be 
used to validate the CDA model itself. In the longer term, we 
intend to expand the analysis to include more complex and 
realistic scenarios including a larger number of input param- 
eters evaluated against more output metrics. 
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